April 28, 2015

Problem Statement

Trans-dimensional problems

Primary Goal

To solve problems where "…the number of things you don’t know is one of the things you don’t know!"

What is RJMCMC?

RJMCMC makes you wanna…

Reversible Chain

  • A reversible Markov chain is one which the transition distribution \(P\) satisfies the detailed balance condition.
  • The detailed balance condition states for all Borel sets \(A, B \subset \mathcal{X}\):

    \[ \int_{(x, x') \in A \times B} \pi(dx) P(x, dx') = \int_{(x, x') \in A \times B} \pi(dx') P(x', dx) \]

For Model Selection

Scenario:

  • Have a countable collection of models: \(\{\mathcal{M}_k: k \in \mathcal{M}\}\)
  • Each model \(\mathcal{M}_k\) has unknown parameter vector \(\boldsymbol \theta_k\)
  • Length of parameter vector may vary from model to model
  • Notation: \((k, \boldsymbol \theta_k)\) refers to model \(k\) and its parameter vector
  • Notation: \(\pi(\theta_{k})\) is the target distribution

Algorithm (model selection)

If the current state of the chain is \((k, \boldsymbol \theta_k)\), then:

  1. Propose a new model \(\mathcal{M}_{k^*}\) with probability \(j(k^*| k)\).
  2. Generate \(\boldsymbol u\) from a specified proposal density \(g(\boldsymbol u)\)
  3. Set \((\boldsymbol \theta_{k^*}^*, \boldsymbol u^*) = h'(\boldsymbol \theta_k, \boldsymbol u)\) where \(h'\) is a bijection between \((\boldsymbol \theta_k, \boldsymbol u)\) and \((\boldsymbol \theta^*_{^*}, \boldsymbol u^*)\) where the following must hold: \(dim(\boldsymbol \theta_k) + dim(\boldsymbol u) = dim(\boldsymbol \theta^*_{k^*}) + dim(\boldsymbol u^*)\).
  4. Accept the proposed move to \((k^*, \boldsymbol \theta_{k^*}^*)\) with probability

    \[ \alpha = \min\left\{1, \frac{\pi( \theta_{k^*}^*) j(k^*| k) g'(\boldsymbol u^*)}{\pi(\boldsymbol \theta_k) j(k|k^*) g(\boldsymbol u)} \left|\frac{\partial h'( \theta_{k^*}^*, \boldsymbol u^*)}{\partial (\boldsymbol \theta_k, \boldsymbol u)}\right|\right\} \]

    Where \(u^* \sim g'\) (Chen, Ibrahim, and Shao 2000, 303)

For Bayesian Model Selection

  • Assign each \(\mathcal{M}_k\) a prior probability \(p(k)\)
  • Target distribution becomes the posterior distribution of each model, \(\pi(\boldsymbol{ \theta_{k}}|D,\mathcal{M}_{k})\), where \(D\) is the data
  • Acceptance probability becomes

    \[ \alpha = \min\left\{1, \frac{p(k^*)\pi( \theta_{k^*}^*|D,\mathcal{M}_{k^*}) j(k^*| k) g'(\boldsymbol u^*)}{p(k)\pi(\boldsymbol \theta_k |D,\mathcal{M}_k ) j(k|k^*) g(\boldsymbol u)} \left|\frac{\partial h'( \theta_{k^*}^*, \boldsymbol u^*)}{\partial (\boldsymbol \theta_k, \boldsymbol u)}\right|\right\} \]

  • Posterior model probability is \(\hat{p}(k | D) = \frac{1}{L} \sum\limits_{l = 1}^L \mathcal{I}(k_l = k)\) where \(L\) is the length of the MCMC chain and \(k_l\) is the model chosen at iteration \(l\).

Example

Poisson vs Negative Binomial Model

(Green and Hastie 2009)

If count data are overdispersed relative to the Poisson distribution, the Negative-Binomial distribution may be more appropriate.

  • Soccer data: Total goals from 1,140 English Premier league games, 2005/2006 to 2007/2008 seasons
  • Is data better modeled using Poisson or Negative-Binomial distribution?

Under an identically distributed Poisson model, one parameter \(\lambda > 0\)

\[ \mathcal{L}(Y|\lambda) = \prod_{i=1}^N \dfrac{\lambda^{Y_i}}{Y_i!}e^{-\lambda} \]

Under an identically distributed Negative-Binomial model, two parameters \(\lambda > 0\), \(\kappa > 0\)

\[ \mathcal{L}(Y|\lambda, \kappa) = \prod_{i=1}^N \dfrac{\lambda^{Y_i}}{Y_i!} \dfrac{\Gamma \left(1/\kappa + Y_i\right)}{\Gamma \left(1/\kappa\right)\left(1/\kappa + \lambda\right)^{Y_i}}\left(1+\kappa \lambda\right)^{-1/\kappa} \]

The Problem Setup

Consider the model selection problem with \(\{\mathcal{M}_k: k = 1,2\}\)

  • Assume \(P(k=1) = P(k=2) = 0.5\)

Under Model \(k = 1\), \(\theta_1 = \lambda\)

  • \(Y_i | \lambda \sim \text{Poisson}(\lambda)\)
  • \(\lambda \sim \text{Gamma}(\alpha_{\lambda}, \beta_{\lambda})\)

Under Model \(k = 2\), \(\theta_2 = (\lambda, \kappa)\)

  • \(Y_i|\lambda,\kappa \sim \text{NegBin}(\lambda, \kappa)\)
  • \(\lambda, \kappa \sim \text{Gamma}(\alpha_{\lambda}, \beta_{\lambda}) \times \text{Gamma}(\alpha_{\kappa}, \beta_{\kappa})\)

Posterior Distribution

\[ \pi(k, \theta_k| Y) \propto \begin{cases} \frac{1}{2} \mathcal{L}(Y|\lambda) \text{Gamma}(\alpha_{\lambda}, \beta_{\lambda}) & k=1\\ \frac{1}{2} \mathcal{L}(Y|\lambda,\kappa) \text{Gamma}(\alpha_{\lambda}, \beta_{\lambda}) \text{Gamma}(\alpha_{\kappa}, \beta_{\kappa}) & k=2\\ \end{cases} \]

RJMCMC to move between models

Since the two models are of differing dimensions, need reversible jump to move between models.

Move from Model 1 to Model 2

  • Let \(g\) be the \(N(0,\sigma^2)\) density for fixed \(\sigma\). Generate \(u\) from \(g\).
  • Let \(x = (1, \theta_1)\). Set \(x' = (2, \theta_2')\), where

    \[ \theta_2' = (\theta_{2,1}', \theta_{2,2}') = h(\theta_1, u) = (\theta_1, \mu e^u) \]

    for fixed \(\mu\). Now \(\kappa = \mu e^u\).

Move from Model 2 to Model 1

  • Let \[ (\theta_1, u) = h'(\theta_2') = (\theta_{2,1}', log(\theta_{2,2}'/\mu)) \]
  • Doesn't require new random variable \(u'\) (i.e. dim(\(u'\))=0) and likewise no \(g'\).

Then \((x,u)\) has dimension 2 + 1 = 3, which matches dimension of \((x',u')\) = 3 + 0 = 3.

Acceptance probabilities

Let \(m=\left\{(1,2), (2,1)\right\}\).

  • \(\alpha_m(x,x') = \text{min}\left\{1, A_m(x, x') \right\}\) where \[ A_m(x,x') = \dfrac{\pi(x')j_m(x')g_m'(u')}{\pi(x)j_m(x)g_m(u)}\left\vert \dfrac{\partial(\theta_{k'}', u')}{\partial (\theta_k, u)} \right\vert\\ = \dfrac{\pi(x')}{\pi(x)g(u)}\left\vert \dfrac{\partial(\theta_{k'}', u')}{\partial (\theta_k, u)} \right\vert \]

Model 1 to Model 2

\[ |J| = \left\vert \begin{array}{cc} 1 & 0\\ 0 & \mu e^u\\ \end{array}\right\vert = \mu e^u \]

Model 2 to Model 1

\[ |J| = \left\vert \begin{array}{cc} 1 & 0\\ 0 & (\mu/\theta_{2,2}')\mu\\ \end{array}\right\vert = 1/\theta_{2,2}' \]

Acceptance probabilities, cont.

Move from Model 1 to Model 2

\(\alpha_{1,2}(x, x') = \text{min}\{1, A_{1,2}\}\) where \[ A_{1,2} = \dfrac{\pi(2, \theta_2')}{\pi(1, \theta_1)} \left\{\dfrac{1}{\sqrt{2\pi\sigma^2}} \text{exp} \left\lbrack \dfrac{-u^2}{2\sigma^2} \right\rbrack \right\}^{-1} \mu e^u \]

Move from Model 2 to Model 1

\(\alpha_{2,1}(x', x) = \text{min}\{1, A_{2,1}\}\) where \[ A_{2,1} = \dfrac{\pi(1, \theta_1)}{\pi(2, \theta_2')} \dfrac{1}{\sqrt{2\pi\sigma^2}} \text{exp} \left\lbrack \dfrac{-log(\theta_{2,2}'/\mu)^2}{2\sigma^2} \right\rbrack \frac{1}{\theta_{2,2}'} \]

The algorithm

Results

  • choose \(\mu = 0.015\), \(\sigma=1.5\), \(\alpha_{\lambda} = 25\), \(\beta_{\lambda}=10\), \(\alpha_{\kappa} = 1\), \(\beta_{\kappa}=10\).
  • 50,000 iterations with burn-in of 5,000
  • \(p(k=1|\theta, y) = 0.71\), \(p(k=2|\theta, y) = 0.29\)

Comments

  • Convergence of algorithm is dependent on choice of \(\mu\) and \(\sigma\)
  • Choice of \(g\) and \(h\) are not unique. e.g. \(u\sim \text{Gamma}\), \(h(\theta, u) = (\theta, u)\)

Challenges for implementation

The Challenges Aren't What You Think

Implementing reversible algorithms may seem difficult for several reasons.

  • Much of the work being reported is from the perspective of MCMC "experts"
  • The language required to present these samplers is necessarily complex

These issues are not the cause of difficulty though

  • In the practical sense, implementation is actually fairly easy
  • Very few areas require a detailed understanding of the underlying theorhetical framework
  • There is little justification needed to garuntee a sampler is able to simulate from a target

Efficiency

The main issue is usually not whether a proposal mechanism will work, but whether it will work effeciently.

  • Ineffecient chains are slow to explore the support of the target distribution
  • Taking more time to converge to the target distribution

The tuning for a specific problem can be arduous, which has lead to work on finding useful general techniques for selecting parts of the mechanism.

General Approaches to Improve Efficiency

Improving efficiency requires the proposed state \((k', \theta'_{k'})\) and the existitng state \((k, \theta_k)\) have similar state spaces.

There are two main classes of methods for ensuring this:

  • Order methods: parameterizing proposals (our \(g(u)\)) for a given \(h(\theta, u) = (\theta', u')\).
  • Saturated state: augment the state space \(\mathcal{X}\) with auxiliary variables

Order methods

  • For a given initial state \(\theta_k\) in model \(k\) we can find an equivalent state \(c_{k,k'}(\theta_k)\) in model \(k'\).
  • We call these equiavlent states are called "centering points"
  • If we constrain \(A((k,\theta_k), (k', c_{k,k'}(\theta_k)\) to be, say, 1, then moving from one model (\(k\)) to another (\(k'\)) will be encouraged and the state space will be more thoroughly explored.

The order of the method determines the type of constraint imposed:

  • \(0^{th}\)-order: \(A((k,\theta_k), (k', c_{k,k'}(\theta_k) = 1\)
  • \(k^{th}\)-order: \(\nabla A((k,\theta_k), (k', c_{k,k'}(\theta_k) = \mathbf{0}\)

Saturated State Space Approach

  • For a given state space \(\mathcal{X}\) add additional "auxiliary" variables so that each model has the same number of parameters as the largest model.
  • This means that changing between models with different numbers of parameters is equivalent to changing the auxiliary variables in or out
  • Between state changes become more likely meaning the sampler covers the set of possible proposals more quickly

Other Ways to Improve Proposals

  • Adaptive sampling: using past observations (even rejected ones) to make mid-run adjustments to the proposal

    • Type 1: diminishing - continuous adaptation, but at a decreasing rate
    • Type 2: through regeneration - if regions of the state space exists where incoming chains are likely independent of outgoing chains, adapt as chains enter and leave them
  • Delayed rejection: if proposal \(x'\) is rejected, try a backup proposal \(x''\)

Finding Appropriate Diagnostics

  • Goal of Diagnostics: provide clear rules for identifying good chain behavior
  • Problem: improving efficiency means promoting transitions between models
  • Between model transitions are problematic:
    • Chains may "stabilize" quickly inside a model but slowly between models
    • Diagnostics depend on model and transitions
  • Recent work: understand "within run" and "between run" variability
    • Accounting for how much disruption in chain behavior due to switching models

Beyond…

There's more?

Extensions to…

  • Adaptive RJMCMC (D. Hastie 2005)
  • Interacting Sequential MCMC (Jasra et al. 2008)
  • Simulated Annealing extended for trans-dimensionsal problems (S. Geman and Geman 1984)

Alternatives to…

for trans-dimensional problems

  • Jump diffusion (Grenander and Miller 1994)
  • Marked point processes (M. Stephens 2000)
  • Product Space approach (Carlin and Chib 1995)

Thanks!

Questions?

Appendix

Development of \(\alpha\)

Metropolis-Hastings on a general state space

To construct a Markov chain on a state space \(\mathcal{X}\) with invariant distribution \(\pi\). We will only consider reversible chains, so the transition kernel \(P\) satisfies the detailed balance condition

\[ \int\limits_{(x, x') \in A \times B} \pi(dx)P(x, dx') = \int\limits_{(x, x') \in A \times B} \pi(dx')P(x', dx) \]

for all Borel sets \(A,B \subset \mathcal{X}\). We make a transition by frawing a candidate new state \(x'\) from the proposal measure \(q(x, dx')\) and accepting it with probability \(\alpha(x, x')\). We we reject, we stay in the current state so that \(P(x, dx')\) has an element at \(x\). This constributes \(\int_{A\cap B} P(x, \{x\}) \pi(dx)\) to each side of the balance equation; subtracting leaves

\[ \int\limits_{(x, x') \in A \times B} \pi(dx)q(x, dx')\alpha(x, x') = \int\limits_{(x, x') \in A \times B} \pi(dx')q(x', dx)\alpha(x', x) \]

Now $ (dx)q(x, dx')$ is dominated by a symmetric measure \(\mu\) on \(\mathcal{X} \times \mathcal{X}\) and let its density wrt \(\mu\) be denoted \(f\). Then the above equality becomes

\[ \int\limits_{(x, x') \in A \times B} \alpha(x, x')f(x, x')\mu(dx, dx') = \int\limits_{(x, x') \in A \times B} \alpha(x', x)f(x', x)\mu(dx', dx) \]

and by the summetry of \(\mu\), this is satisfied for all Borel \(A,B\) if

\[ \alpha(x, x') = \min\left\{1, \frac{f(x', x)}{f(x, x')}\right\} = \min\left\{1, \frac{\pi(dx')q(x', dx)}{\pi(dx)q(x, dx')}\right\} \]

Development of \(\alpha\) Cont'd

Constructive representation in terms of random numbers

Let \(\mathcal{X} \subset \mathbb{R}^d\) and suppose \(\pi\) has a density wrt to the \(d\)-dimensional Lebesgue measure (also denoted \(\pi\)). At the current state \(x\), we generate \(r\) random numbers \(u\) from a known joint density \(g\) and then form the proposed new state as a deterministic function of the current state and the random numbers so that \(x' = h(x, u)\). Then the second equality on the previous slide can be written

\[ \int\limits_{(x, x') \in A \times B} \pi(x) g(u) \alpha(x, x') dx du = \int\limits_{(x, x') \in A \times B} \pi(x') g(u') \alpha(x', x) dx' du'. \]

Where the reverse transition from \(x'\) to \(x\) is made with the aid of random numbers \(u' \sim g'\) giving \(x = \mathcal{H}(x', u')\) and this transformation from \((x, u)\) to \((x', u')\) is a diffeomorphism. Then the \((d + r)\)-dimensional integral equality holds if

\[ \pi(x)g(u)\alpha(x, x') = \pi(x') g(u') \alpha(x', x) \left|\frac{\partial(x', u')}{\partial(x,u)}\right|. \]

Then, a valid choice for \(\alpha\) is

\[ \alpha(x, x') = \min\left\{1, \frac{\pi(x')g'(u')}{\pi(x)g(u)} \left|\frac{\partial(x', u')}{\partial(x,u)}\right|\right\} \]

(Green 2003)

References

Carlin, Bradley P, and Siddhartha Chib. 1995. “Bayesian Model Choice via Markov Chain Monte Carlo Methods.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 473–84.

Chen, Ming-Hui, Joseph G Ibrahim, and Qi-Man Shao. 2000. Monte Carlo Methods in Bayesian Computation. Springer.

Geman, Stuart, and Donald Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 6. IEEE: 721–41.

Green, Peter J. 2003. “Trans-Dimensional Markov Chain Monte Carlo.” Oxford Statistical Science Series. OXFORD UNIV PRESS, 179–98.

Green, Peter J, and David I Hastie. 2009. “Reversible Jump MCMC.” Genetics 155 (3): 1391–1403.

Grenander, Ulf, and Michael I Miller. 1994. “Representations of Knowledge in Complex Systems.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 549–603.

Hastie, David. 2005. “Towards Automatic Reversible Jump Markov Chain Monte Carlo.” PhD thesis, University of Bristol.

Jasra, Ajay, Arnaud Doucet, David A Stephens, and Christopher C Holmes. 2008. “Interacting Sequential Monte Carlo Samplers for Trans-Dimensional Simulation.” Computational Statistics & Data Analysis 52 (4). Elsevier: 1765–91.

Stephens, Matthew. 2000. “Bayesian Analysis of Mixture Models with an Unknown Number of Components-an Alternative to Reversible Jump Methods.” Annals of Statistics. JSTOR, 40–74.